cherrypick: add lock wait time by daviszhen · Pull Request #24885 · matrixorigin/matrixone

daviszhen · 2026-06-08T09:24:15Z

What type of PR is this?

Which issue(s) this PR fixes:

What this PR does / why we need it:

1, 启用lock_wait_timeout session 级别变量. mysql也有. 用户可以设置此值. lock_wait_timeout 的默认值 31536000 (1 year), 能兼容目前的行为. 如果用户设置了超时时间就用用户的值. 2, 增加lock rpc timeout . 如果lock_wait_timeout设置了, lock rpc timeout 就用 lock_wait_timeout值.

避免几个小时加锁傻等.

3, 修改lockoption 增加lockwaittimeout选项.

4, 给 RPC 加 slack（宽松预算）：RPC 超时 = lock_wait_timeout + 30s，让服务端有足够时间返回超时结果 5, 服务端异步路径也强制检查超时：之前异步路径（远程锁走这条）从不检查 LockWaitTimeout，现在 waiterEvents.check() 里会定期检查并通知超时 6, 客户端翻译 deadline 错误：如果 RPC deadline 到了但 caller context 还没到，说明是锁超时而非连接问题，直接翻译成 ErrLockTimeout

修改后的效果:
session 1 begin个事务, 加行锁.
session 2 delete 行, 等锁, 超时

session1

MySQL [test]> select * from t1;
+------+
| a    |
+------+
|    2 |
|    3 |
|    4 |
|    1 |
+------+
4 rows in set (0.001 sec)

MySQL [test]>
MySQL [test]>
MySQL [test]> begin;
Query OK, 0 rows affected (0.000 sec)

MySQL [test]> select * from t1 where a = 1 for update;
+------+
| a    |
+------+
|    1 |
+------+
1 row in set (0.001 sec)

MySQL [test]>

session2

MySQL [test]> select @@session.lock_wait_timeout;
+---------------------+
| @@lock_wait_timeout |
+---------------------+
| 180                 |
+---------------------+
1 row in set (0.000 sec)

MySQL [test]> delete from t1 where a = 1;
ERROR 1105 (HY000): context deadline exceeded
MySQL [test]>

@iamlinjunhong

1, 启用lock_wait_timeout session 级别变量. mysql也有. 用户可以设置此值. lock_wait_timeout 的默认值 `31536000 (1 year)`, 能兼容目前的行为. 如果用户设置了超时时间就用用户的值. 2, 增加lock rpc timeout . 如果lock_wait_timeout设置了, lock rpc timeout 就用 lock_wait_timeout值. 避免几个小时加锁傻等. 3, 修改lockoption 增加lockwaittimeout选项. 4, 给 RPC 加 slack（宽松预算）：RPC 超时 = lock_wait_timeout + 30s，让服务端有足够时间返回超时结果 5, 服务端异步路径也强制检查超时：之前异步路径（远程锁走这条）从不检查 LockWaitTimeout，现在 waiterEvents.check() 里会定期检查并通知超时 6, 客户端翻译 deadline 错误：如果 RPC deadline 到了但 caller context 还没到，说明是锁超时而非连接问题，直接翻译成 ErrLockTimeout 修改后的效果: session 1 begin个事务, 加行锁. session 2 delete 行, 等锁, 超时 ``` session1 MySQL [test]> select * from t1; +------+ | a | +------+ | 2 | | 3 | | 4 | | 1 | +------+ 4 rows in set (0.001 sec) MySQL [test]> MySQL [test]> MySQL [test]> begin; Query OK, 0 rows affected (0.000 sec) MySQL [test]> select * from t1 where a = 1 for update; +------+ | a | +------+ | 1 | +------+ 1 row in set (0.001 sec) MySQL [test]> ``` ``` session2 MySQL [test]> select @@session.lock_wait_timeout; +---------------------+ | @@lock_wait_timeout | +---------------------+ | 180 | +---------------------+ 1 row in set (0.000 sec) MySQL [test]> delete from t1 where a = 1; ERROR 1105 (HY000): context deadline exceeded MySQL [test]> ``` Approved by: @iamlinjunhong, @ouyuanning, @XuPeng-SH, @aunjgr, @fengttt

qodo-code-review · 2026-06-08T09:24:19Z

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

aptend

LGTM. Reviewed the lock_wait_timeout propagation from session/txn options into lock options, local and remote lock wait handling, async waiter timeout path, and related tests. No blocking issues found.

XuPeng-SH

I think there is still a real multi-CN/session-semantics hole here. lock_wait_timeout is supposed to behave like a session variable, but on the remote-CN lock path it can get stuck at the value captured when the transaction was created.

The flow is:

frontend copies lock_wait_timeout into TxnOptions only when the txn is created (pkg/frontend/txn.go:428-434);
lockop.lockWaitTimeout() prefers the current process resolver, but falls back to TxnOptions when that resolver is unavailable (pkg/sql/colexec/lockop/lock_op.go:789-809);
remote processes reconstructed from ProcessInfo do not carry a resolve-variable function or arbitrary session sysvars — only basic session info such as user/host/database/version/timezone is serialized (pkg/vm/process/process_codec.go:48-100).

So if a user does something like BEGIN; SET SESSION lock_wait_timeout = 1; ... and the later locking statement runs on a remote CN, the local path can see the new 1-second setting, but the remote path falls back to the stale txn-start value from TxnOptions (possibly the default 1 year). That means the same session-level change behaves differently depending on whether the lock is local or remote, which is exactly the kind of wait this PR is trying to fix.

I'd like to see the remote path get the current session value as well (or the feature/documentation explicitly narrowed to txn-start snapshot semantics), plus a test that changes lock_wait_timeout after BEGIN and then exercises a remote lock wait.

daviszhen · 2026-06-26T07:41:13Z

I think there is still a real multi-CN/session-semantics hole here. lock_wait_timeout is supposed to behave like a session variable, but on the remote-CN lock path it can get stuck at the value captured when the transaction was created.

The flow is:

frontend copies lock_wait_timeout into TxnOptions only when the txn is created (pkg/frontend/txn.go:428-434);

lockop.lockWaitTimeout() prefers the current process resolver, but falls back to TxnOptions when that resolver is unavailable (pkg/sql/colexec/lockop/lock_op.go:789-809);

remote processes reconstructed from ProcessInfo do not carry a resolve-variable function or arbitrary session sysvars — only basic session info such as user/host/database/version/timezone is serialized (pkg/vm/process/process_codec.go:48-100).

So if a user does something like BEGIN; SET SESSION lock_wait_timeout = 1; ... and the later locking statement runs on a remote CN, the local path can see the new 1-second setting, but the remote path falls back to the stale txn-start value from TxnOptions (possibly the default 1 year). That means the same session-level change behaves differently depending on whether the lock is local or remote, which is exactly the kind of wait this PR is trying to fix.

I'd like to see the remote path get the current session value as well (or the feature/documentation explicitly narrowed to txn-start snapshot semantics), plus a test that changes lock_wait_timeout after BEGIN and then exercises a remote lock wait.

已经修改

XuPeng-SH

Re-reviewed the latest head from the session-semantics and unhappy-path angle.

The previous blocker I raised looks addressed now:

BuildProcessInfo now snapshots the current lock_wait_timeout into pipeline.SessionInfo.LockWaitTimeout, so remote CN processes decoded from ProcessInfo can see the current statement/session value instead of falling back to the txn-start TxnOptions value.
lockop.lockWaitTimeout() now prefers the current resolver, then decoded SessionInfo.LockWaitTimeout, then the txn option fallback. That covers both local execution and remote-process execution.
The remote async waiter path now enforces LockWaitTimeout with a precise timer instead of relying only on the coarse lazy check.
Regression coverage includes the case I was worried about: BEGIN; SET SESSION lock_wait_timeout = 1; ... followed by a remote lock wait.

Local checks:

git diff --check origin/main...HEAD
go test ./pkg/lockservice -run 'TestLockWaitTimeout|TestRemoteLockWaitTimeout|TestRetryRemoteLockError' -count=1
go test ./pkg/txn/client -run 'TestWithTxnLockWaitTimeout' -count=1
git merge-tree --write-tree HEAD origin/main

Those passed. I also attempted the focused pkg/sql/colexec/lockop and pkg/vm/process tests, but this local environment cannot compile those packages because required CGO headers are missing (usearch.h, xxhash.h), so I am not treating that as a test failure. CI is green on the PR.

LGTM.

mergify · 2026-06-26T09:17:09Z

Queued — the merge queue status continues in this comment ↓.

mergify · 2026-06-26T09:17:40Z

Merge Queue Status

✅ Entered queue — 2026-06-26 09:17 UTC · Rule: main · triggered by rule Automatic queue on approval for main
❌ Checks failed · in-place
🚫 Left the queue — 2026-06-26 11:33 UTC · at 62733359eacb0fbb91758aef1ee26c02f57f14d9

This pull request spent 2 hours 16 minutes 18 seconds in the queue, with no time running CI.

Waiting for

All conditions

Reason

The merge conditions cannot be satisfied due to failing checks

Failing checks:

Hint

You may have to fix your CI before adding the pull request to the queue again.
If you update this pull request, to fix the CI, it will automatically be requeued once the queue conditions match again.
If you think this was a flaky issue instead, you can requeue the pull request, without updating it, by posting a @mergifyio queue comment.

Tick the box to put this pull request back in the merge queue (same as @mergifyio queue).

Requeue this pull request

daviszhen requested review from XuPeng-SH, aptend, aunjgr, iamlinjunhong and ouyuanning as code owners June 8, 2026 09:24

daviszhen changed the title ~~cherrypick: add lock wait time (#24476)~~ cherrypick: add lock wait time Jun 8, 2026

mergify Bot added kind/bug Something isn't working kind/enhancement labels Jun 8, 2026

daviszhen temporarily deployed to ci June 8, 2026 09:25 — with GitHub Actions Inactive

daviszhen temporarily deployed to ci June 8, 2026 09:31 — with GitHub Actions Inactive

matrix-meow added the size/L Denotes a PR that changes [500,999] lines label Jun 8, 2026

aptend approved these changes Jun 8, 2026

View reviewed changes

aunjgr approved these changes Jun 8, 2026

View reviewed changes

XuPeng-SH requested changes Jun 12, 2026

View reviewed changes

update

6b3bc04

daviszhen had a problem deploying to ci June 24, 2026 12:42 — with GitHub Actions Failure

daviszhen temporarily deployed to ci June 24, 2026 12:42 — with GitHub Actions Inactive

daviszhen had a problem deploying to ci June 24, 2026 12:42 — with GitHub Actions Failure

daviszhen temporarily deployed to ci June 25, 2026 09:53 — with GitHub Actions Inactive

update

569cd5c

daviszhen had a problem deploying to ci June 26, 2026 03:36 — with GitHub Actions Error

daviszhen temporarily deployed to ci June 26, 2026 03:36 — with GitHub Actions Inactive

daviszhen had a problem deploying to ci June 26, 2026 03:36 — with GitHub Actions Error

daviszhen temporarily deployed to ci June 26, 2026 03:36 — with GitHub Actions Inactive

daviszhen had a problem deploying to ci June 26, 2026 03:36 — with GitHub Actions Failure

daviszhen temporarily deployed to ci June 26, 2026 03:36 — with GitHub Actions Inactive

update

cba72dd

daviszhen temporarily deployed to ci June 26, 2026 04:29 — with GitHub Actions Inactive

daviszhen added 3 commits June 26, 2026 14:16

update bvt

d949de0

update

1134357

update

91403c6

XuPeng-SH approved these changes Jun 26, 2026

View reviewed changes

Merge branch 'main' into 0608-pick3.0-to-main-58ad528

6ed76e5

heni02 approved these changes Jun 26, 2026

View reviewed changes

iamlinjunhong approved these changes Jun 26, 2026

View reviewed changes

Merge branch 'main' into 0608-pick3.0-to-main-58ad528

6273335

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cherrypick: add lock wait time#24885

cherrypick: add lock wait time#24885
daviszhen wants to merge 15 commits into
matrixorigin:mainfrom
daviszhen:0608-pick3.0-to-main-58ad528

daviszhen commented Jun 8, 2026

Uh oh!

qodo-code-review Bot commented Jun 8, 2026

Uh oh!

aptend left a comment

Uh oh!

XuPeng-SH left a comment

Uh oh!

daviszhen commented Jun 26, 2026

Uh oh!

XuPeng-SH left a comment

Uh oh!

mergify Bot commented Jun 26, 2026 •

edited

Loading

Uh oh!

mergify Bot commented Jun 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Uh oh!

Conversation

daviszhen commented Jun 8, 2026

What type of PR is this?

Which issue(s) this PR fixes:

What this PR does / why we need it:

Uh oh!

qodo-code-review Bot commented Jun 8, 2026

Qodo reviews are paused for this user.

Uh oh!

aptend left a comment

Choose a reason for hiding this comment

Uh oh!

XuPeng-SH left a comment

Choose a reason for hiding this comment

Uh oh!

daviszhen commented Jun 26, 2026

Uh oh!

XuPeng-SH left a comment

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Queue Status

Reason

Hint

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

mergify Bot commented Jun 26, 2026 •

edited

Loading

mergify Bot commented Jun 26, 2026 •

edited

Loading